Objective: to gain insights related to the organisation and pre-processing of gridded data as well as to work with time series data.
Data organisation is the practice of classifying and organising data to make it more usable.
It is preferable to save files (corresponding to a time series) with the date (and/or time of day) in the name because:
The alphabetical organisation of files should correspond to the chronological order of the files.
File name consistency (length and style).
We suggest that each file for a certain dataset should have the same base and then a numeric identifier (for date, iteration number, scenario, etc.)
To ensure the same length of file names, months and days from 1 to 9 should be written from 01 to 09
Exporting tables (csv, txt, zoo, etc.).
Same data always in the same location (i.e. in the same row or column)
Make sure that values with no data are not recorded as zeroes!
Sometimes we only want to use particular files for a certain analysis. Therefore, it is useful to learn how to subset data from a directory.
We have already seen that we can list all of the files within a folder:
To select a subset of these files we can use the grep function. This functions looks for a predefined pattern in the files and returns the position of each element.
If the base name of the files does not contain the pattern, we could use the entire path and the parameter value as follows:
grep("2011", ssebop_files, value = TRUE)
## [1] "../Data/L4_Data_Processing_Data/SSEBop//SSEBop_201101.tif"
## [2] "../Data/L4_Data_Processing_Data/SSEBop//SSEBop_201102.tif"
## [3] "../Data/L4_Data_Processing_Data/SSEBop//SSEBop_201103.tif"
## [4] "../Data/L4_Data_Processing_Data/SSEBop//SSEBop_201104.tif"
## [5] "../Data/L4_Data_Processing_Data/SSEBop//SSEBop_201105.tif"
## [6] "../Data/L4_Data_Processing_Data/SSEBop//SSEBop_201106.tif"
## [7] "../Data/L4_Data_Processing_Data/SSEBop//SSEBop_201107.tif"
## [8] "../Data/L4_Data_Processing_Data/SSEBop//SSEBop_201108.tif"
## [9] "../Data/L4_Data_Processing_Data/SSEBop//SSEBop_201109.tif"
## [10] "../Data/L4_Data_Processing_Data/SSEBop//SSEBop_201110.tif"
## [11] "../Data/L4_Data_Processing_Data/SSEBop//SSEBop_201111.tif"
## [12] "../Data/L4_Data_Processing_Data/SSEBop//SSEBop_201112.tif"In the case that we want to use the vector of positions, we have to subset the vector of files as follows:
pos <- grep("2011", basename(ssebop_files))
ssebop_2011 <- ssebop_files[pos]
head(ssebop_2011)
## [1] "../Data/L4_Data_Processing_Data/SSEBop//SSEBop_201101.tif"
## [2] "../Data/L4_Data_Processing_Data/SSEBop//SSEBop_201102.tif"
## [3] "../Data/L4_Data_Processing_Data/SSEBop//SSEBop_201103.tif"
## [4] "../Data/L4_Data_Processing_Data/SSEBop//SSEBop_201104.tif"
## [5] "../Data/L4_Data_Processing_Data/SSEBop//SSEBop_201105.tif"
## [6] "../Data/L4_Data_Processing_Data/SSEBop//SSEBop_201106.tif"This could be done in one line:
What if we only want the files for every March?
ssebop_march <- ssebop_files[grep("03.tif", basename(ssebop_files))]
head(ssebop_march)
## [1] "../Data/L4_Data_Processing_Data/SSEBop//SSEBop_201003.tif"
## [2] "../Data/L4_Data_Processing_Data/SSEBop//SSEBop_201103.tif"
## [3] "../Data/L4_Data_Processing_Data/SSEBop//SSEBop_201203.tif"
## [4] "../Data/L4_Data_Processing_Data/SSEBop//SSEBop_201303.tif"This works because file names are logical and consistent!
We typically need to pre-process raw data so that it can be used in the way that we require.
Load the CHIRPSv2 precipitation monthly raw data for January, 2000.
Now, we can import the raster using the terra package.
CHIRPSv2 only provides in-land precipitation. NA values (i.e., where no estimate is made) are stored in the raw data as equal to -9999.
Now that the data has been pre-processed, we can save it as a new raster.
Sometimes, we have to aggregate data for analysing different temporal scales. One product can be retrieved at daily scale but monthly or annual estimates may be required. In this example, we want to derive an annual precipitation raster from the monthly CHIRPS rasters for the years 2000-2003.
head(chirps_files)
## [1] "../Data/L4_Data_Processing_Data/CHIRPS_monthly/chirps_monthly2000-01.tif"
## [2] "../Data/L4_Data_Processing_Data/CHIRPS_monthly/chirps_monthly2000-02.tif"
## [3] "../Data/L4_Data_Processing_Data/CHIRPS_monthly/chirps_monthly2000-03.tif"
## [4] "../Data/L4_Data_Processing_Data/CHIRPS_monthly/chirps_monthly2000-04.tif"
## [5] "../Data/L4_Data_Processing_Data/CHIRPS_monthly/chirps_monthly2000-05.tif"
## [6] "../Data/L4_Data_Processing_Data/CHIRPS_monthly/chirps_monthly2000-06.tif"
tail(chirps_files)
## [1] "../Data/L4_Data_Processing_Data/CHIRPS_monthly/chirps_monthly2003-07.tif"
## [2] "../Data/L4_Data_Processing_Data/CHIRPS_monthly/chirps_monthly2003-08.tif"
## [3] "../Data/L4_Data_Processing_Data/CHIRPS_monthly/chirps_monthly2003-09.tif"
## [4] "../Data/L4_Data_Processing_Data/CHIRPS_monthly/chirps_monthly2003-10.tif"
## [5] "../Data/L4_Data_Processing_Data/CHIRPS_monthly/chirps_monthly2003-11.tif"
## [6] "../Data/L4_Data_Processing_Data/CHIRPS_monthly/chirps_monthly2003-12.tif"Now, we want to generate annual raster files by adding the 12 monthly files for each year.
years <- substr(basename(chirps_files), 15, 18)
print(years)
## [1] "2000" "2000" "2000" "2000" "2000" "2000" "2000" "2000" "2000" "2000"
## [11] "2000" "2000" "2001" "2001" "2001" "2001" "2001" "2001" "2001" "2001"
## [21] "2001" "2001" "2001" "2001" "2002" "2002" "2002" "2002" "2002" "2002"
## [31] "2002" "2002" "2002" "2002" "2002" "2002" "2003" "2003" "2003" "2003"
## [41] "2003" "2003" "2003" "2003" "2003" "2003" "2003" "2003"How can we write a for loop to generate one annual raster file at a time?
output_dir <- "C:/User/CHIRPSv2_annual"
for(i in 1:length(unique_years)){
# Here we extract the paths of the year i
positions <- grep(unique_years[i], basename(chirps_files))
files <- chirps_files[positions]
# Then we stack the monthly files of the year i
files <- rast(files)
# Now we have to add the monthly precipitation to get the annual precipitation
annual <- sum(files)
# Finally we can export the file. However, we first have to create a different (and identifiable)
# name for each year. So we will include the year into the new file name.
name <- paste0("CHIRPS_annual_", unique_years[i])
names(annual) <- name
name <- paste0(name, ".tif")
writeRaster(annual, file.path(output_dir, name), overwrite = TRUE)
}If you want to look at how it is working step by step we suggest setting i = 1 (or any other number, and then running each step individually and printing/plotting the results after each step.
i <- 1
positions <- grep(unique_years[i], basename(chirps_files))
print(positions)
## [1] 1 2 3 4 5 6 7 8 9 10 11 12
files <- chirps_files[positions]
head(files)
## [1] "../Data/L4_Data_Processing_Data/CHIRPS_monthly/chirps_monthly2000-01.tif"
## [2] "../Data/L4_Data_Processing_Data/CHIRPS_monthly/chirps_monthly2000-02.tif"
## [3] "../Data/L4_Data_Processing_Data/CHIRPS_monthly/chirps_monthly2000-03.tif"
## [4] "../Data/L4_Data_Processing_Data/CHIRPS_monthly/chirps_monthly2000-04.tif"
## [5] "../Data/L4_Data_Processing_Data/CHIRPS_monthly/chirps_monthly2000-05.tif"
## [6] "../Data/L4_Data_Processing_Data/CHIRPS_monthly/chirps_monthly2000-06.tif"files <- rast(files)
print(files)
## class : SpatRaster
## dimensions : 2000, 7200, 12 (nrow, ncol, nlyr)
## resolution : 0.05, 0.05 (x, y)
## extent : -180, 180, -50, 50 (xmin, xmax, ymin, ymax)
## coord. ref. : lon/lat WGS 84 (EPSG:4326)
## sources : chirps_monthly2000-01.tif
## chirps_monthly2000-02.tif
## chirps_monthly2000-03.tif
## ... and 9 more source(s)
## names : chirp~00-01, chirp~00-02, chirp~00-03, chirp~00-04, chirp~00-05, chirp~00-06, ...
## min values : 0, 0, 0, 0, 0, 0, ...
## max values : 1251.456, 1467.558, 1222.492, 1224.301, 1322.237, 1584.747, ...
annual <- sum(files)name <- paste0("CHIRPS_annual_", unique_years[i])
names(annual) <- name
name <- paste0(name, ".tif")
print(name)
## [1] "CHIRPS_annual_2000.tif"
writeRaster(annual, file.path(output_dir, name), overwrite = TRUE)Let’s have a look at what is now saved in the CHIRPS_annual folder.
Now, in R:
annual_files <- list.files(output_dir, full.names = TRUE)
print(annual_files)
## [1] "../Data/L4_Data_Processing_Data/CHIRPS_annual/CHIRPS_annual_2000.tif"
## [2] "../Data/L4_Data_Processing_Data/CHIRPS_annual/CHIRPS_annual_2001.tif"
## [3] "../Data/L4_Data_Processing_Data/CHIRPS_annual/CHIRPS_annual_2002.tif"
## [4] "../Data/L4_Data_Processing_Data/CHIRPS_annual/CHIRPS_annual_2003.tif"Let’s plot the annual rasters.
We could also have used the %in% command in the previous example. This command returns the positions of specific values in a vector.
a <- c(10, 10, 20, 20, 30, 30, 40, 40)
a %in% 20
## [1] FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSEIf we use the %in% command in combination with the which command:
for(i in 1:length(unique_years)){
positions <- which(years %in% unique_years[i])
files <- chirps_files[positions]
# Then we stack the monthly files of the year i
files <- rast(files)
# Now we have to add the monthly precipitation to get the annual precipitation
annual <- sum(files)
# Finally we can export the file. However, we first have to create a different (and identifiable)
# name for each year. So we will include the year into the new file name.
name <- paste0("CHIRPS_annual_", unique_years[i])
names(annual) <- name
name <- paste0(name, ".tif")
writeRaster(annual, file.path(output_dir, name), overwrite = TRUE)
}In many instances, we will be working with long time series and will want to extract data over a particular period. We have already looked at how to extract data from particular columns (see example from L2).
Here, we will look at subsetting data according to period and calculating monthly statistics. For this purpose, we will use the hydroTSM package, which is designed to efficiently handle time series. Even though hydroTSM was designed for the field of hydrology, it is an efficient package for managing time series, regardless of the scientific field in which we are working.
Please note that the hydroTSM package loads the zoo package
We will first read the csv file:
head(temp_data)
## Date tmax tmin
## 1 1950-01-01 16.87414245 2.267171503
## 2 1950-01-02 14.12995701 7.472945504
## 3 1950-01-03 12.87256241 -0.322433109
## 4 1950-01-04 12.7848237 -3.291659449
## 5 1950-01-05 15.15733243 -3.763536855
## 6 1950-01-06 20.72999248 -5.898691615
dim(temp_data)
## [1] 25567 3
summary(temp_data)
## Date tmax tmin
## Length:25567 Length:25567 Length:25567
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :characterThe NA values have been saved as this text string in the csv file provided (#N/A). We will reread the csv file, specifying the na.strings parameter to interpret the NAs correctly.
tail(temp_data)
## Date tmax tmin
## 25562 2019-12-26 14.67704 7.608191
## 25563 2019-12-27 NA NA
## 25564 2019-12-28 NA NA
## 25565 2019-12-29 NA NA
## 25566 2019-12-30 15.46794 7.934365
## 25567 2019-12-31 21.43409 10.980799
summary(temp_data)
## Date tmax tmin
## Length:25567 Min. : 7.407 Min. :-7.701
## Class :character 1st Qu.:21.273 1st Qu.: 7.584
## Mode :character Median :25.868 Median :12.026
## Mean :25.874 Mean :11.614
## 3rd Qu.:30.341 3rd Qu.:15.742
## Max. :45.716 Max. :27.670
## NA's :1881 NA's :1843We will use the zoo function to convert the object temp.data to a zoo object.
For our example:
The dates do not correspond to a column of data! There are only two columns (tmax and tmin).
head(temp_data)
## tmax tmin
## 1950-01-01 16.87414 2.2671715
## 1950-01-02 14.12996 7.4729455
## 1950-01-03 12.87256 -0.3224331
## 1950-01-04 12.78482 -3.2916594
## 1950-01-05 15.15733 -3.7635369
## 1950-01-06 20.72999 -5.8986916
summary(temp_data)
## Index tmax tmin
## Min. :1950-01-01 Min. : 7.407 Min. :-7.701
## 1st Qu.:1967-07-02 1st Qu.:21.273 1st Qu.: 7.584
## Median :1984-12-31 Median :25.868 Median :12.026
## Mean :1984-12-31 Mean :25.874 Mean :11.614
## 3rd Qu.:2002-07-01 3rd Qu.:30.341 3rd Qu.:15.742
## Max. :2019-12-31 Max. :45.716 Max. :27.670
## NA's :1881 NA's :1843
dim(temp_data)
## [1] 25567 2To access the dates, we use the index function.
dates <- index(temp_data)
head(dates)
## [1] "1950-01-01" "1950-01-02" "1950-01-03" "1950-01-04" "1950-01-05"
## [6] "1950-01-06"
tail(dates)
## [1] "2019-12-26" "2019-12-27" "2019-12-28" "2019-12-29" "2019-12-30"
## [6] "2019-12-31"
index(temp_data)[3:5]
## [1] "1950-01-03" "1950-01-04" "1950-01-05"We can use the window function to extract data over a particular time period.
temp_1990s <- window(temp_data, start = "1990-01-01", end = "1999-12-31")
summary(temp_1990s)
## Index tmax tmin
## Min. :1990-01-01 Min. :10.59 Min. :-1.821
## 1st Qu.:1992-07-01 1st Qu.:21.99 1st Qu.: 9.007
## Median :1994-12-31 Median :26.34 Median :13.071
## Mean :1994-12-31 Mean :26.41 Mean :12.708
## 3rd Qu.:1997-07-01 3rd Qu.:30.62 3rd Qu.:16.543
## Max. :1999-12-31 Max. :44.71 Max. :26.289
## NA's :195 NA's :180We can use the daily2monthly function to calculate statistics for each month.
x: is the zoo object FUN: function to be applied na.rm: logical. Shoul the NAs be removed?
Note that the hydroTSM package includes conversions to monthly, seasonal and annual values.
For our example, let’s calculate the mean tmax and tmin for each month of the 1990s.
monthly_temp1990s <- daily2monthly(temp_1990s, mean, na.rm = TRUE)
head(monthly_temp1990s)
## tmax tmin
## 1990-01-01 22.13839 6.021942
## 1990-02-01 20.89031 6.032163
## 1990-03-01 22.10266 10.087943
## 1990-04-01 23.71689 13.046162
## 1990-05-01 25.62137 13.678022
## 1990-06-01 30.10750 16.801449
summary(monthly_temp1990s)
## Index tmax tmin
## Min. :1990-01-01 Min. :18.73 Min. : 5.058
## 1st Qu.:1992-06-23 1st Qu.:22.24 1st Qu.: 9.054
## Median :1994-12-16 Median :26.21 Median :12.645
## Mean :1994-12-16 Mean :26.35 Mean :12.649
## 3rd Qu.:1997-06-08 3rd Qu.:30.21 3rd Qu.:16.650
## Max. :1999-12-01 Max. :35.33 Max. :20.311
## NA's :1 NA's :1To visualise the time series stored in a zoo object as a plot:
Now that you have already processed some large data on your computer, let’s have a look at how much space has been used up in writing temporary files.
Go to the following file path in File Manager: C:
If you’re analysing large amounts of data, the temporary files will start taking up a lot of room on your computer. There are a few options:
Periodically delete temporary files from your computer
Tell R to write the raster temporary files to another location (e.g., an external drive). You will still need to delete them later. Note that this is only for the raster files.
To change the directory where the temporary files are stored you can use the rasterOptions function.
Additionally, you could add code to delete temporary files. This can be done with the file.remove function.
Each core of a CPU can perform operations separately from the others. If no special instructions are given, R will only use one computer core. If you need to process very large amounts of data, please look at the following links: